pip install scikit-learn==1.2.2
Requirement already satisfied: scikit-learn==1.2.2 in /opt/homebrew/anaconda3/lib/python3.11/site-packages (1.2.2) Requirement already satisfied: numpy>=1.17.3 in /opt/homebrew/anaconda3/lib/python3.11/site-packages (from scikit-learn==1.2.2) (1.24.3) Requirement already satisfied: scipy>=1.3.2 in /opt/homebrew/anaconda3/lib/python3.11/site-packages (from scikit-learn==1.2.2) (1.11.1) Requirement already satisfied: joblib>=1.1.1 in /opt/homebrew/anaconda3/lib/python3.11/site-packages (from scikit-learn==1.2.2) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/homebrew/anaconda3/lib/python3.11/site-packages (from scikit-learn==1.2.2) (2.2.0) Note: you may need to restart the kernel to use updated packages.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
conda install nbformat
Collecting package metadata (current_repodata.json): done
Solving environment: \
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:
- defaults/noarch::conda-pack==0.6.0=pyhd3eb1b0_0
- defaults/osx-arm64::_anaconda_depends==2023.09=py311_openblas_1
- defaults/osx-arm64::anaconda-project==0.11.1=py311hca03da5_0
- defaults/osx-arm64::python-lsp-black==1.2.1=py311hca03da5_0
- defaults/noarch::backports.functools_lru_cache==1.6.4=pyhd3eb1b0_0
- defaults/osx-arm64::twisted==22.10.0=py311h80987f9_0
- defaults/noarch::pyls-spyder==0.4.0=pyhd3eb1b0_0
- defaults/osx-arm64::conda-index==0.3.0=py311hca03da5_0
- defaults/osx-arm64::spyder==5.4.3=py311hca03da5_1
- defaults/osx-arm64::zope.interface==5.4.0=py311h80987f9_0
- defaults/osx-arm64::conda-build==3.26.1=py311hca03da5_0
- defaults/osx-arm64::python-lsp-server==1.7.2=py311hca03da5_0
- defaults/noarch::conda-verify==3.4.2=py_1
- defaults/osx-arm64::anaconda-client==1.12.1=py311hca03da5_0
- defaults/osx-arm64::conda-libmamba-solver==23.7.0=py311hca03da5_0
- defaults/osx-arm64::scrapy==2.8.0=py311hca03da5_0
- defaults/osx-arm64::conda-repo-cli==1.0.75=py311hca03da5_0
- defaults/osx-arm64::anaconda-navigator==2.5.2=py311hca03da5_0
- defaults/osx-arm64::clyent==1.2.2=py311hca03da5_1
- defaults/osx-arm64::conda==23.7.4=py311hca03da5_0
- defaults/osx-arm64::bcrypt==3.2.0=py311h80987f9_1
- defaults/osx-arm64::navigator-updater==0.4.0=py311hca03da5_1
- defaults/noarch::conda-token==0.4.0=pyhd3eb1b0_0
done
==> WARNING: A newer version of conda exists. <==
current version: 23.7.4
latest version: 24.3.0
Please update conda by running
$ conda update -n base -c defaults conda
Or to minimize the number of packages updated during conda update use
conda install conda=24.3.0
## Package Plan ##
environment location: /opt/homebrew/anaconda3
added / updated specs:
- nbformat
The following NEW packages will be INSTALLED:
pip pkgs/main/osx-arm64::pip-23.3.1-py311hca03da5_0
setuptools pkgs/main/osx-arm64::setuptools-68.2.2-py311hca03da5_0
The following packages will be UPDATED:
ca-certificates 2023.12.12-hca03da5_0 --> 2024.3.11-hca03da5_0
Downloading and Extracting Packages
Preparing transaction: done
Verifying transaction: failed
RemoveError: 'setuptools' is a dependency of conda and cannot be removed from
conda's operating environment.
Note: you may need to restart the kernel to use updated packages.
data = pd.read_csv('heart_2022_no_nans.csv')
#data.drop(columns=['State'], inplace=True)
data
#data.head(10)
| State | Sex | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | Female | Very good | 4.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 9.0 | None of them | No | ... | 1.60 | 71.67 | 27.99 | No | No | Yes | Yes | Yes, received Tdap | No | No |
| 1 | Alabama | Male | Very good | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 6.0 | None of them | No | ... | 1.78 | 95.25 | 30.13 | No | No | Yes | Yes | Yes, received tetanus shot but not sure what type | No | No |
| 2 | Alabama | Male | Very good | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | No | 8.0 | 6 or more, but not all | No | ... | 1.85 | 108.86 | 31.66 | Yes | No | No | Yes | No, did not receive any tetanus shot in the pa... | No | Yes |
| 3 | Alabama | Female | Fair | 5.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 9.0 | None of them | No | ... | 1.70 | 90.72 | 31.32 | No | No | Yes | Yes | No, did not receive any tetanus shot in the pa... | No | Yes |
| 4 | Alabama | Female | Good | 3.0 | 15.0 | Within past year (anytime less than 12 months ... | Yes | 5.0 | 1 to 5 | No | ... | 1.55 | 79.38 | 33.07 | No | No | Yes | Yes | No, did not receive any tetanus shot in the pa... | No | No |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 246017 | Virgin Islands | Male | Very good | 0.0 | 0.0 | Within past 2 years (1 year but less than 2 ye... | Yes | 6.0 | None of them | No | ... | 1.78 | 102.06 | 32.28 | Yes | No | No | No | Yes, received tetanus shot but not sure what type | No | No |
| 246018 | Virgin Islands | Female | Fair | 0.0 | 7.0 | Within past year (anytime less than 12 months ... | Yes | 7.0 | None of them | No | ... | 1.93 | 90.72 | 24.34 | No | No | No | No | No, did not receive any tetanus shot in the pa... | No | Yes |
| 246019 | Virgin Islands | Male | Good | 0.0 | 15.0 | Within past year (anytime less than 12 months ... | Yes | 7.0 | 1 to 5 | No | ... | 1.68 | 83.91 | 29.86 | Yes | Yes | Yes | Yes | Yes, received tetanus shot but not sure what type | No | Yes |
| 246020 | Virgin Islands | Female | Excellent | 2.0 | 2.0 | Within past year (anytime less than 12 months ... | Yes | 7.0 | None of them | No | ... | 1.70 | 83.01 | 28.66 | No | Yes | Yes | No | Yes, received tetanus shot but not sure what type | No | No |
| 246021 | Virgin Islands | Male | Very good | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | No | 5.0 | None of them | Yes | ... | 1.83 | 108.86 | 32.55 | No | Yes | Yes | Yes | No, did not receive any tetanus shot in the pa... | No | Yes |
246022 rows × 40 columns
#DELETE
print(data['AgeCategory'])
0 Age 65 to 69
1 Age 70 to 74
2 Age 75 to 79
3 Age 80 or older
4 Age 80 or older
...
246017 Age 60 to 64
246018 Age 25 to 29
246019 Age 65 to 69
246020 Age 50 to 54
246021 Age 70 to 74
Name: AgeCategory, Length: 246022, dtype: object
# Encode the AgeCategory.
# We could assign a number to each age group,
# but for now, we will apply a unique identifier for each group.
encode_AgeCategory = {
'Age 18 to 24': 21,
'Age 25 to 29': 27,
'Age 30 to 34': 32,
'Age 35 to 39': 37,
'Age 40 to 44': 42,
'Age 45 to 49': 47,
'Age 50 to 54': 52,
'Age 55 to 59': 57,
'Age 60 to 64': 62,
'Age 65 to 69': 67,
'Age 70 to 74': 72,
'Age 75 to 79': 77,
'Age 80 or older': 80
}
data['Age_Category_Avg'] = data['AgeCategory'].map(encode_AgeCategory)
#data.to_csv('heart_2022_no_nans.csv', index=False)
#data_2.to_csv('modified_data.csv', index=False)
print(data['AgeCategory'])
0 Age 65 to 69
1 Age 70 to 74
2 Age 75 to 79
3 Age 80 or older
4 Age 80 or older
...
246017 Age 60 to 64
246018 Age 25 to 29
246019 Age 65 to 69
246020 Age 50 to 54
246021 Age 70 to 74
Name: AgeCategory, Length: 246022, dtype: object
print(data['Age_Category_Avg'])
0 67
1 72
2 77
3 80
4 80
..
246017 62
246018 27
246019 67
246020 52
246021 72
Name: Age_Category_Avg, Length: 246022, dtype: int64
print(data['AgeCategory'].dtype)
object
data.describe()
| PhysicalHealthDays | MentalHealthDays | SleepHours | HeightInMeters | WeightInKilograms | BMI | Age_Category_Avg | |
|---|---|---|---|---|---|---|---|
| count | 246022.000000 | 246022.000000 | 246022.000000 | 246022.000000 | 246022.000000 | 246022.000000 | 246022.000000 |
| mean | 4.119026 | 4.167140 | 7.021331 | 1.705150 | 83.615179 | 28.668136 | 55.392262 |
| std | 8.405844 | 8.102687 | 1.440681 | 0.106654 | 21.323156 | 6.513973 | 17.218703 |
| min | 0.000000 | 0.000000 | 1.000000 | 0.910000 | 28.120000 | 12.020000 | 21.000000 |
| 25% | 0.000000 | 0.000000 | 6.000000 | 1.630000 | 68.040000 | 24.270000 | 42.000000 |
| 50% | 0.000000 | 0.000000 | 7.000000 | 1.700000 | 81.650000 | 27.460000 | 57.000000 |
| 75% | 3.000000 | 4.000000 | 8.000000 | 1.780000 | 95.250000 | 31.890000 | 72.000000 |
| max | 30.000000 | 30.000000 | 24.000000 | 2.410000 | 292.570000 | 97.650000 | 80.000000 |
# checking if we still have any null data (even though the author of the file says it is already cleaned)
data.isnull().sum()
State 0 Sex 0 GeneralHealth 0 PhysicalHealthDays 0 MentalHealthDays 0 LastCheckupTime 0 PhysicalActivities 0 SleepHours 0 RemovedTeeth 0 HadHeartAttack 0 HadAngina 0 HadStroke 0 HadAsthma 0 HadSkinCancer 0 HadCOPD 0 HadDepressiveDisorder 0 HadKidneyDisease 0 HadArthritis 0 HadDiabetes 0 DeafOrHardOfHearing 0 BlindOrVisionDifficulty 0 DifficultyConcentrating 0 DifficultyWalking 0 DifficultyDressingBathing 0 DifficultyErrands 0 SmokerStatus 0 ECigaretteUsage 0 ChestScan 0 RaceEthnicityCategory 0 AgeCategory 0 HeightInMeters 0 WeightInKilograms 0 BMI 0 AlcoholDrinkers 0 HIVTesting 0 FluVaxLast12 0 PneumoVaxEver 0 TetanusLast10Tdap 0 HighRiskLastYear 0 CovidPos 0 Age_Category_Avg 0 dtype: int64
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 246022 entries, 0 to 246021 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 246022 non-null object 1 Sex 246022 non-null object 2 GeneralHealth 246022 non-null object 3 PhysicalHealthDays 246022 non-null float64 4 MentalHealthDays 246022 non-null float64 5 LastCheckupTime 246022 non-null object 6 PhysicalActivities 246022 non-null object 7 SleepHours 246022 non-null float64 8 RemovedTeeth 246022 non-null object 9 HadHeartAttack 246022 non-null object 10 HadAngina 246022 non-null object 11 HadStroke 246022 non-null object 12 HadAsthma 246022 non-null object 13 HadSkinCancer 246022 non-null object 14 HadCOPD 246022 non-null object 15 HadDepressiveDisorder 246022 non-null object 16 HadKidneyDisease 246022 non-null object 17 HadArthritis 246022 non-null object 18 HadDiabetes 246022 non-null object 19 DeafOrHardOfHearing 246022 non-null object 20 BlindOrVisionDifficulty 246022 non-null object 21 DifficultyConcentrating 246022 non-null object 22 DifficultyWalking 246022 non-null object 23 DifficultyDressingBathing 246022 non-null object 24 DifficultyErrands 246022 non-null object 25 SmokerStatus 246022 non-null object 26 ECigaretteUsage 246022 non-null object 27 ChestScan 246022 non-null object 28 RaceEthnicityCategory 246022 non-null object 29 AgeCategory 246022 non-null object 30 HeightInMeters 246022 non-null float64 31 WeightInKilograms 246022 non-null float64 32 BMI 246022 non-null float64 33 AlcoholDrinkers 246022 non-null object 34 HIVTesting 246022 non-null object 35 FluVaxLast12 246022 non-null object 36 PneumoVaxEver 246022 non-null object 37 TetanusLast10Tdap 246022 non-null object 38 HighRiskLastYear 246022 non-null object 39 CovidPos 246022 non-null object 40 Age_Category_Avg 246022 non-null int64 dtypes: float64(6), int64(1), object(34) memory usage: 77.0+ MB
data.shape
(246022, 41)
data.value_counts('HadHeartAttack')
HadHeartAttack No 232587 Yes 13435 Name: count, dtype: int64
A few details from the first glance:
The dataset contains imbalanced data, reflecting an uneven distribution across its categories. (Examples and plots will be provided further)
Features such as "HadStroke", "AgeCategory", "DifficultyWalking", and possibly "HadDiabetes" have a greater impact on predicting the higher target variable rate ("HadHeartAttack").
"Sex" and "Race" exhibit lower correlation values, indicating a weaker direct relationship with heart attack in this dataset. That means we can drop these features in future.
# Encode 'Sex' column
data_check = data.copy()
data_check['Sex'] = data_check['Sex'].map({'Female': 0, 'Male': 1})
# Encode 'RaceEthnicityCategory' column
data_check['RaceEthnicityCategory'] = data_check['RaceEthnicityCategory'].map({
'White only, Non-Hispanic': 0,
'Black only, Non-Hispanic': 1,
'Other race only, Non-Hispanic': 2,
'Multiracial, Non-Hispanic': 3,
'Hispanic': 4
})
# Encode 'HadHeartAttack' column
data_check['HadHeartAttack'] = data_check['HadHeartAttack'].map({'Yes': 1, 'No': 0})
# Create a correlation matrix
corr_matrix = data_check[['Sex', 'RaceEthnicityCategory', 'HadHeartAttack']].corr()
# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", annot_kws={"size": 12})
plt.title('Correlation Heatmap')
plt.show()
sns.countplot(x="HadHeartAttack", data=data)
plt.title("Distribution of Heart Attack")
plt.xlabel("Had Heart Attack")
plt.ylabel("Count")
plt.xticks(ticks=[0, 1], labels=["No", "Yes"]) # Rename the x-axis tick labels
plt.show()
#MAYBE
plt.figure(figsize=(12,12))
sns.boxplot(data=data)
plt.title('Boxplots of Numerical Features')
plt.show()
cat_data=data.select_dtypes(include='object')
num_data=data.select_dtypes(exclude='object')
#categorical features: ['HeartDisease', 'Smoking', 'AlcoholDrinking', 'Stroke',
#'DiffWalking', 'Sex', 'Race', 'Diabetic', 'PhysicalActivity',
#'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer']
#numerical features: ['BMI', 'PhysicalHealth', 'MentalHealth', 'AgeCategory', 'SleepTime']
print("categorical features: ", cat_data.columns.to_list())
print("numerical features: ", num_data.columns.to_list())
data.head()
categorical features: ['State', 'Sex', 'GeneralHealth', 'LastCheckupTime', 'PhysicalActivities', 'RemovedTeeth', 'HadHeartAttack', 'HadAngina', 'HadStroke', 'HadAsthma', 'HadSkinCancer', 'HadCOPD', 'HadDepressiveDisorder', 'HadKidneyDisease', 'HadArthritis', 'HadDiabetes', 'DeafOrHardOfHearing', 'BlindOrVisionDifficulty', 'DifficultyConcentrating', 'DifficultyWalking', 'DifficultyDressingBathing', 'DifficultyErrands', 'SmokerStatus', 'ECigaretteUsage', 'ChestScan', 'RaceEthnicityCategory', 'AgeCategory', 'AlcoholDrinkers', 'HIVTesting', 'FluVaxLast12', 'PneumoVaxEver', 'TetanusLast10Tdap', 'HighRiskLastYear', 'CovidPos'] numerical features: ['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'HeightInMeters', 'WeightInKilograms', 'BMI', 'Age_Category_Avg']
| State | Sex | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | Age_Category_Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | Female | Very good | 4.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 9.0 | None of them | No | ... | 71.67 | 27.99 | No | No | Yes | Yes | Yes, received Tdap | No | No | 67 |
| 1 | Alabama | Male | Very good | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 6.0 | None of them | No | ... | 95.25 | 30.13 | No | No | Yes | Yes | Yes, received tetanus shot but not sure what type | No | No | 72 |
| 2 | Alabama | Male | Very good | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | No | 8.0 | 6 or more, but not all | No | ... | 108.86 | 31.66 | Yes | No | No | Yes | No, did not receive any tetanus shot in the pa... | No | Yes | 77 |
| 3 | Alabama | Female | Fair | 5.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 9.0 | None of them | No | ... | 90.72 | 31.32 | No | No | Yes | Yes | No, did not receive any tetanus shot in the pa... | No | Yes | 80 |
| 4 | Alabama | Female | Good | 3.0 | 15.0 | Within past year (anytime less than 12 months ... | Yes | 5.0 | 1 to 5 | No | ... | 79.38 | 33.07 | No | No | Yes | Yes | No, did not receive any tetanus shot in the pa... | No | No | 80 |
5 rows × 41 columns
for c in cat_data:
plt.rcParams['figure.figsize'] = (20, 8)
sns.countplot(x=c, hue='HadHeartAttack', data=data)
plt.title(f'Heart Disease Count Grouped by {c} Status')
plt.xlabel(c)
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()
sns.heatmap(num_data.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
data_check = data.copy()
# Encode 'HadHeartAttack' column
data_check['HadHeartAttack'] = data_check['HadHeartAttack'].map({'Yes': 1, 'No': 0})
# Select numerical and categorical columns
num_data = data.select_dtypes(exclude='object')
num_data['HadHeartAttack'] = data_check['HadHeartAttack'] # Include 'HadHeartAttack' in numerical data
# Save as data_num_cat_check
data_num_cat_check = num_data
# Build Correlation Heatmap
corr_matrix = data_num_cat_check.corr()
# Plot heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", annot_kws={"size": 10})
plt.title('Correlation Heatmap')
plt.show()
sns.pairplot(data,height=2)
plt.show()
/opt/homebrew/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
disease=['HadAngina', 'HadStroke', 'HadAsthma', 'HadSkinCancer', 'HadCOPD',
'HadDepressiveDisorder', 'HadKidneyDisease', 'HadArthritis', 'HadDiabetes']
for d in disease:
df_filtered = data[data[d] == 'Yes']
if not df_filtered.empty:
plt.figure(figsize=(20,10))
sns.countplot(x='HadHeartAttack', data=df_filtered)
plt.title(f'Heart Disease Count among Patients with {d}')
plt.xlabel('HadHeartAttack')
plt.ylabel('Count')
plt.show()
for feature in num_data:
plt.figure(figsize=(10, 6))
sns.histplot(data=data, x=feature, hue='HadHeartAttack', kde=True, element='step', stat='count')
plt.title(f'Histogram of {feature} with Heart Attack Overlay')
plt.xlabel(feature)
plt.ylabel('Count')
plt.legend()
plt.show()
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
sorted_age_categories = sorted(data['AgeCategory'].unique())
sns.scatterplot(data=data, x='BMI', y='AgeCategory', hue='HadHeartAttack')
# Set the title, labels, and legend
plt.title('Scatter Plot of BMI vs Age by Heart Attack Status')
plt.xlabel('BMI')
plt.ylabel('Age')
plt.yticks(ticks=range(len(sorted_age_categories)), labels=sorted_age_categories) # Set y-axis tick labels
plt.legend(title='Heart Attack')
# Show the plot
plt.show()
#num_data = num_data.drop(columns=['HadHeartAttack'])
num_data.hist(figsize=(16, 20), bins=40, xlabelsize=6, ylabelsize=6);
age_heart_disease = data.groupby('AgeCategory')['HadHeartAttack'].value_counts().unstack().fillna(0)
age_heart_disease.plot(kind='bar', stacked=True)
plt.title('Number of People with Heart Disease by Age Category')
plt.xlabel('Age Category')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.legend(title='Heart Attack', labels=['No', 'Yes'])
plt.tight_layout()
plt.show()
gender_heart_attack = data.groupby('Sex')['HadHeartAttack'].value_counts().unstack().fillna(0)
gender_heart_attack.plot(kind='bar', stacked=True)
plt.title('Number of People with Heart Attack by Sex Category')
plt.xlabel('Sex')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Heart Attack', labels=['Female', 'Male'])
plt.tight_layout()
plt.show()
I will encoded categorical features using Label Encoder.
I also will apply RobustScaler to minimize the skewness of outliers. Note: no outlier removal was performed as they were present in a significant amount, which is important for current data analysis.
I will use oversampling with SMOTE to handle the imbalance of the majority class.
from sklearn import preprocessing
from sklearn.preprocessing import RobustScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
import sklearn
print(sklearn.__version__)
1.2.2
pip install scikit-learn==0.22.2
ERROR: Could not find a version that satisfies the requirement scikit-learn==0.22.2 (from versions: 0.9, 0.10, 0.11, 0.12, 0.12.1, 0.13, 0.13.1, 0.14, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.17, 0.17.1, 0.18, 0.18.1, 0.18.2, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.20.4, 0.21.1, 0.21.2, 0.21.3, 0.22, 0.22.1, 0.22.2.post1, 0.23.0, 0.23.1, 0.23.2, 0.24.0, 0.24.1, 0.24.2, 1.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.2.0rc1, 1.2.0, 1.2.1, 1.2.2, 1.3.0rc1, 1.3.0, 1.3.1, 1.3.2, 1.4.0rc1, 1.4.0, 1.4.1.post1) ERROR: No matching distribution found for scikit-learn==0.22.2 Note: you may need to restart the kernel to use updated packages.
print(sklearn.__version__)
1.2.2
label_encoder = preprocessing.LabelEncoder()
for c in cat_data:
data[c]= label_encoder.fit_transform(data[c])
data[c].unique()
data.head()
| State | Sex | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | Age_Category_Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 4 | 4.0 | 0.0 | 3 | 1 | 9.0 | 3 | 0 | ... | 71.67 | 27.99 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 67 |
| 1 | 0 | 1 | 4 | 0.0 | 0.0 | 3 | 1 | 6.0 | 3 | 0 | ... | 95.25 | 30.13 | 0 | 0 | 1 | 1 | 2 | 0 | 0 | 72 |
| 2 | 0 | 1 | 4 | 0.0 | 0.0 | 3 | 0 | 8.0 | 1 | 0 | ... | 108.86 | 31.66 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | 77 |
| 3 | 0 | 0 | 1 | 5.0 | 0.0 | 3 | 1 | 9.0 | 3 | 0 | ... | 90.72 | 31.32 | 0 | 0 | 1 | 1 | 0 | 0 | 2 | 80 |
| 4 | 0 | 0 | 2 | 3.0 | 15.0 | 3 | 1 | 5.0 | 0 | 0 | ... | 79.38 | 33.07 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 80 |
5 rows × 41 columns
scaler=RobustScaler()
scaled_data=scaler.fit_transform(data)
sns.boxplot(data=scaled_data)
plt.show()
race_groups = data.groupby('RaceEthnicityCategory')['HadHeartAttack'].value_counts(normalize=True).unstack(fill_value=0)
race_groups['Ratio'] = race_groups[1] / race_groups[0]
print(race_groups[['Ratio']])
HadHeartAttack Ratio RaceEthnicityCategory 0 0.048208 1 0.039565 2 0.064873 3 0.050887 4 0.061260
Since there isn't any difference in the ratio of heart disease across different race/ethnicity categories, it suggests that race doesn't have any significant effect on heart attack in individuals.
data.drop(columns=['RaceEthnicityCategory'], inplace=True)
data.head()
| State | Sex | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | ... | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | Age_Category_Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 4 | 4.0 | 0.0 | 3 | 1 | 9.0 | 3 | 0 | ... | 71.67 | 27.99 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 67 |
| 1 | 0 | 1 | 4 | 0.0 | 0.0 | 3 | 1 | 6.0 | 3 | 0 | ... | 95.25 | 30.13 | 0 | 0 | 1 | 1 | 2 | 0 | 0 | 72 |
| 2 | 0 | 1 | 4 | 0.0 | 0.0 | 3 | 0 | 8.0 | 1 | 0 | ... | 108.86 | 31.66 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | 77 |
| 3 | 0 | 0 | 1 | 5.0 | 0.0 | 3 | 1 | 9.0 | 3 | 0 | ... | 90.72 | 31.32 | 0 | 0 | 1 | 1 | 0 | 0 | 2 | 80 |
| 4 | 0 | 0 | 2 | 3.0 | 15.0 | 3 | 1 | 5.0 | 0 | 0 | ... | 79.38 | 33.07 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 80 |
5 rows × 40 columns
data.drop(columns=['State'], inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 246022 entries, 0 to 246021 Data columns (total 39 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sex 246022 non-null int64 1 GeneralHealth 246022 non-null int64 2 PhysicalHealthDays 246022 non-null float64 3 MentalHealthDays 246022 non-null float64 4 LastCheckupTime 246022 non-null int64 5 PhysicalActivities 246022 non-null int64 6 SleepHours 246022 non-null float64 7 RemovedTeeth 246022 non-null int64 8 HadHeartAttack 246022 non-null int64 9 HadAngina 246022 non-null int64 10 HadStroke 246022 non-null int64 11 HadAsthma 246022 non-null int64 12 HadSkinCancer 246022 non-null int64 13 HadCOPD 246022 non-null int64 14 HadDepressiveDisorder 246022 non-null int64 15 HadKidneyDisease 246022 non-null int64 16 HadArthritis 246022 non-null int64 17 HadDiabetes 246022 non-null int64 18 DeafOrHardOfHearing 246022 non-null int64 19 BlindOrVisionDifficulty 246022 non-null int64 20 DifficultyConcentrating 246022 non-null int64 21 DifficultyWalking 246022 non-null int64 22 DifficultyDressingBathing 246022 non-null int64 23 DifficultyErrands 246022 non-null int64 24 SmokerStatus 246022 non-null int64 25 ECigaretteUsage 246022 non-null int64 26 ChestScan 246022 non-null int64 27 AgeCategory 246022 non-null int64 28 HeightInMeters 246022 non-null float64 29 WeightInKilograms 246022 non-null float64 30 BMI 246022 non-null float64 31 AlcoholDrinkers 246022 non-null int64 32 HIVTesting 246022 non-null int64 33 FluVaxLast12 246022 non-null int64 34 PneumoVaxEver 246022 non-null int64 35 TetanusLast10Tdap 246022 non-null int64 36 HighRiskLastYear 246022 non-null int64 37 CovidPos 246022 non-null int64 38 Age_Category_Avg 246022 non-null int64 dtypes: float64(6), int64(33) memory usage: 73.2 MB
#y = data['HadHeartAttack']
#X = data.drop(columns=['HadHeartAttack'], inplace=True)
#smote = SMOTE(random_state=42)
#X_resampled, y_resampled = smote.fit_resample(X, y)
y = data['HadHeartAttack']
X = data.drop('HadHeartAttack', axis=1)
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
x_train,x_test,y_train,y_test = train_test_split(X_resampled,y_resampled,test_size=0.30,random_state=42)
x_train_non,x_test_non,y_train_non,y_test_non = train_test_split(X,y,test_size=0.30,random_state=42)
Decision Trees
Random Forest
K-Nearest Neighbors (KNN)
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score, classification_report
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score,recall_score
from sklearn.metrics import f1_score
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
decision_tree = DecisionTreeClassifier(random_state=42)
knn = KNeighborsClassifier()
rf=RandomForestClassifier(n_estimators=77, max_depth=None, random_state=42, n_jobs=-1)
def train_without_kfolds(classifier, x_train, y_train, x_test, y_test):
classifier.fit(x_train, y_train)
prediction = classifier.predict(x_test)
predicted_proba = classifier.predict_proba(x_test)[:, 1]
return prediction,predicted_proba
def evaluating_model(model,y_test,y_pred,predicted_proba,step_factor=0.1,threshold=0):
roc_score = 0
threshold_value=threshold
Original_AUC = roc_auc_score(y_test, prediction)
while threshold_value <= 1:
temp_thresh = threshold_value
predicted = (predicted_proba >= temp_thresh).astype('int')
current_roc_score = roc_auc_score(y_test, predicted)
print('Threshold', temp_thresh, '--', current_roc_score)
if roc_score < current_roc_score:
roc_score = current_roc_score
thrsh_score = threshold_value
threshold_value += step_factor
print('---Optimum Threshold ---', thrsh_score, '--ROC--', roc_score)
false_positive_rate1, true_positive_rate1, threshold1 = roc_curve(y_test, predicted_proba)
plt.subplots(1, figsize=(10,10))
plt.title(f'Receiver Operating Characteristic - {model}')
plt.plot(false_positive_rate1, true_positive_rate1)
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
print(f"Test ROC AUC Score: {Original_AUC:.2%}")
print(classification_report(y_test, prediction))
print('----Different scores----')
print(f'Accuracy_score: {accuracy_score(y_test,prediction)}')
print(f'Precission_score: {precision_score(y_test,prediction)}')
print(f'Recall_score: {recall_score(y_test,prediction)}')
print(f'F1-score: {f1_score(y_test,prediction)}')
cm = confusion_matrix(y_test, prediction)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
Testing each model to select the best one for prediction
ROC_AUC Score and its curve, Recall, F1, accuracy and precision score
We opted not to use StratifiedKFolds as we already dealt with imbalance in data through SMOTE
prediction,predicted_proba= train_without_kfolds(decision_tree,x_train, y_train, x_test, y_test)
evaluating_model("Decision Tree",y_test,prediction,predicted_proba)
Threshold 0 -- 0.5 Threshold 0.1 -- 0.9161592787417638 Threshold 0.2 -- 0.9161592787417638 Threshold 0.30000000000000004 -- 0.9161592787417638 Threshold 0.4 -- 0.9161592787417638 Threshold 0.5 -- 0.9161592787417638 Threshold 0.6 -- 0.9161735836655187 Threshold 0.7 -- 0.9161735836655187 Threshold 0.7999999999999999 -- 0.9161735836655187 Threshold 0.8999999999999999 -- 0.9161735836655187 Threshold 0.9999999999999999 -- 0.9161735836655187 ---Optimum Threshold --- 0.6 --ROC-- 0.9161735836655187
Test ROC AUC Score: 91.62%
precision recall f1-score support
0 0.93 0.90 0.91 69906
1 0.90 0.93 0.92 69647
accuracy 0.92 139553
macro avg 0.92 0.92 0.92 139553
weighted avg 0.92 0.92 0.92 139553
----Different scores----
Accuracy_score: 0.9161393879028039
Precission_score: 0.9010520487264674
Recall_score: 0.9345987623300358
F1-score: 0.9175188706505882
prediction,predicted_proba= train_without_kfolds(knn,x_train, y_train, x_test, y_test)
evaluating_model("KNN",y_test,prediction,predicted_proba)
Threshold 0 -- 0.5 Threshold 0.1 -- 0.7924067868815349 Threshold 0.2 -- 0.7924067868815349 Threshold 0.30000000000000004 -- 0.8359218595769424 Threshold 0.4 -- 0.8359218595769424 Threshold 0.5 -- 0.8765540645639255 Threshold 0.6 -- 0.8765540645639255 Threshold 0.7 -- 0.9174526370612743 Threshold 0.7999999999999999 -- 0.9174526370612743 Threshold 0.8999999999999999 -- 0.9487201824645349 Threshold 0.9999999999999999 -- 0.9487201824645349 ---Optimum Threshold --- 0.8999999999999999 --ROC-- 0.9487201824645349
Test ROC AUC Score: 87.66%
precision recall f1-score support
0 1.00 0.75 0.86 69906
1 0.80 1.00 0.89 69647
accuracy 0.88 139553
macro avg 0.90 0.88 0.87 139553
weighted avg 0.90 0.88 0.87 139553
----Different scores----
Accuracy_score: 0.8763265569353579
Precission_score: 0.8018228746572028
Recall_score: 0.999138512785906
F1-score: 0.889671616602635
prediction,predicted_proba= train_without_kfolds(rf,x_train, y_train, x_test, y_test)
evaluating_model("Random Forest",y_test,prediction,predicted_proba)
Threshold 0 -- 0.5 Threshold 0.1 -- 0.7962440681472773 Threshold 0.2 -- 0.8827367656994646 Threshold 0.30000000000000004 -- 0.9254163519360591 Threshold 0.4 -- 0.945370704911418 Threshold 0.5 -- 0.9562530473393123 Threshold 0.6 -- 0.9576023234778211 Threshold 0.7 -- 0.9513614519912462 Threshold 0.7999999999999999 -- 0.9312936067902051 Threshold 0.8999999999999999 -- 0.8752163964519417 Threshold 0.9999999999999999 -- 0.6484629632288541 ---Optimum Threshold --- 0.6 --ROC-- 0.9576023234778211
Test ROC AUC Score: 95.63%
precision recall f1-score support
0 0.96 0.95 0.96 69906
1 0.95 0.96 0.96 69647
accuracy 0.96 139553
macro avg 0.96 0.96 0.96 139553
weighted avg 0.96 0.96 0.96 139553
----Different scores----
Accuracy_score: 0.9562388483228594
Precission_score: 0.9491983146226282
Recall_score: 0.9639036857294643
F1-score: 0.9564944825571868